Predict the Rating for Board Games¶

The data set that contains 80000 board games with game informations and their associated review scores. I'm going to predict average_rating using the other columns.

In [2]:

import pandas as pd
board_games = pd.read_csv("board_games.csv")
board_games.head()

Out[2]:

	id	type	name	yearpublished	minplayers	maxplayers	playingtime	minplaytime	maxplaytime	minage	users_rated	average_rating	bayes_average_rating	total_owners	total_traders	total_wanters	total_wishers	total_comments	total_weights	average_weight
0	12333	boardgame	Twilight Struggle	2005.0	2.0	2.0	180.0	180.0	180.0	13.0	20113	8.33774	8.22186	26647	372	1219	5865	5347	2562	3.4785
1	120677	boardgame	Terra Mystica	2012.0	2.0	5.0	150.0	60.0	150.0	12.0	14383	8.28798	8.14232	16519	132	1586	6277	2526	1423	3.8939
2	102794	boardgame	Caverna: The Cave Farmers	2013.0	1.0	7.0	210.0	30.0	210.0	12.0	9262	8.28994	8.06886	12230	99	1476	5600	1700	777	3.7761
3	25613	boardgame	Through the Ages: A Story of Civilization	2006.0	2.0	4.0	240.0	240.0	240.0	12.0	13294	8.20407	8.05804	14343	362	1084	5075	3378	1642	4.1590
4	3076	boardgame	Puerto Rico	2002.0	2.0	5.0	150.0	90.0	150.0	12.0	39883	8.14261	8.04524	44362	795	861	5414	9173	5213	3.2943

Cleaning¶

In [3]:

board_games.dropna(axis=0, inplace = True)
board_games = board_games[board_games['users_rated'] > 0]

Data Exploration¶

In [4]:

%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(board_games['average_rating'])
plt.show()
plt.boxplot(board_games['average_rating'])
plt.show()

std = board_games['average_rating'].std()
mean = board_games['average_rating'].mean()

print(std)
print(mean)

1.57882993483
6.01611284933

Error Metric¶

The distribution follow a normal distribution, so we can use mean squared error as an error metric

Clustering¶

In [5]:

from sklearn.cluster import KMeans

kmeans_model = KMeans(n_clusters = 5, random_state=1)
numeric_columns = board_games.iloc[:,3:]
kmeans_model.fit(numeric_columns)
labels = kmeans_model.labels_

import numpy
game_mean = numeric_columns.apply(numpy.mean, axis=1)
game_std = numeric_columns.apply(numpy.std, axis=1)
    
plt.scatter(x = game_mean, y = game_std, c = labels)
plt.show()

It looks like most of the games are similar, 4 clusters are between mean = 0 and and mean = 4000

Finding Correlations¶

Remove columns that don't add predictive power to the model.

In [6]:

correlations = board_games.corr()
print(correlations['average_rating'])

id                      0.304201
yearpublished           0.108461
minplayers             -0.032701
maxplayers             -0.008335
playingtime             0.048994
minplaytime             0.043985
maxplaytime             0.048994
minage                  0.210049
users_rated             0.112564
average_rating          1.000000
bayes_average_rating    0.231563
total_owners            0.137478
total_traders           0.119452
total_wanters           0.196566
total_wishers           0.171375
total_comments          0.123714
total_weights           0.109691
average_weight          0.351081
Name: average_rating, dtype: float64

The 'yearpublished' column is surprisingly positively correlated with average_rating. So most recent games tend to be rated more highly.
The more 'minage' is high, the more highly is the score.
The more "weighty" a game is (complexity rating of a game), the more highly it tends to be rated.

In [25]:

cols = list(board_games.columns)
cols.remove("average_rating")
cols.remove("bayes_average_rating")
cols.remove("minplayers")
cols.remove("maxplayers")
# not numeric values
cols.remove("name")
cols.remove("id")
cols.remove("type")

I removed useless columns, like 'bayes_average_rating' derivated from 'average_rating'

Linear Regression¶

In [28]:

from sklearn.linear_model import LinearRegression

# Training
lr = LinearRegression()
lr.fit(board_games[cols], board_games["average_rating"])

# Prediction
predictions = lr.predict(board_games[cols])

from sklearn.metrics import mean_squared_error
import math

mse = mean_squared_error(board_games['average_rating'], predictions)
rmse = math.sqrt(mse)

print(rmse)

1.4479383303003244

The error rate is close to the standard deviation (1.57) of all board game ratings. This indicates that our model may not have high predictive power.